1 Introduction
The Fisher information matrix (FIM) is a key quantity in statistics as it is required for examples for evaluating asymptotic precisions of parameter estimates, for building optimality criteria in experimental designs, for computing Wald test statistics or classical asymptotic distributions in statistical testing (Van der Vaart 2000). It also appears more recently in post model selection inference (Charkhi A. and Claeskens G. 2018), in asymptotic distribution of the likelihood ratio test statistics when testing variance component in mixed models (Baey C., Cournède P.-H., and Kuhn E. 2019) or as a particular Riemannian metric on complex manifold (Le Brigant A., Preston S. C., and Puechmorel S. 2021). However its exact computation is often not trivial. This is in particular the case in many latent variables models, also called incomplete data models, due to the presence of the unobserved variables. Though these models are increasingly used in many fields of application, such as in ecophysiology (Technow F. et al. 2015), in genomic (Picard F. et al. 2007) or in ecology (Gloaguen P. et al. 2014). They especially allow a better consideration of the different variability sources and when appropriate, a more precise characterization of the known mechanisms at the origin of the data. When the FIM can not be exactly computed, people either approximate it numerically, for example by using Monte Carlo technics like developed in the R package MIXFIM (Riviere-Jourdan M.-K. and Mentre F. 2018) or focus on an estimate of the FIM. The probably most widely used is the observed FIM (Efron B. and Hinkley D. V. 1978). When it can not be directly computed in latent variable models, several methods have been proposed to approximate it. Among the most frequently used approaches are Monte-Carlo methods or iterative algorithms derived from the missing information principle (Orchard T. and Woodbury M. A. 1972). Indeed according to this principle, the observed Fisher information matrix can be expressed as the difference between two matrices corresponding to the complete information and the missing information due to the unobserved variables (see e.g. (McLachlan G.-J. and Krishnan T. 2008) chapter 4). It enables the development of alternative methods to compute the observed FIM: the Louis’s method (Louis T. A. 1982), combined with a Monte Carlo method or a stochastic approximation algorithm by (Delyon B., Lavielle M., and Moulines E. 1999), the Oakes method (Oakes D. 1999) or the supplemented Expectation Maximization algorithm (Meng X.-L. and Rubin D. B. 1991). However as the observed FIM involves the second derivatives of the observed log-likelihood, all these methods require to compute second derivatives of the complete data log-likelihood which leads to some disadvantages from a computational point of view. More recently, (Meng L. and Spall J. C. 2017) proposed an accelerated algorithm based on numerical first order derivatives of the conditional expectation of the log-likelihood. Another estimate is the empirical Fisher information matrix. This estimator of the FIM is defined as the moment estimate of the covariance matrix of the score. It is much less used than the observed Fisher information matrix. However it has a nice property since it is positive definite, which is not systematically the case for the latter and it is numerically more interesting because it only requires the calculation of the first derivatives of the log-likelihood.
In this paper, our contribution consists in presenting a new numerical method to evaluate the empirical FIM in latent variables model. Indeed, when the proposed estimate can not be directly analytically evaluated, we propose a stochastic approximation estimation algorithm to compute it, which provides this estimate of the FIM as a by-product of model parameter estimates.
The paper is organized as follows. In Section 2, we recall the three main FIM estimates and discuss their immediate properties. In Section 3, we give practical tools for the computation of the empirical Fisher information matrix in incomplete data models. In particular, we introduce a new stochastic approximation procedure based on the first derivatives of the complete log-likelihood only and state its asymptotic properties. In Section 4, we illustrate the finite sample size properties of both estimators and the convergence properties of the computation algorithm through simulations. The paper ends by a discussion.
2 Moment estimates of the Fisher information matrix
Let us consider a random variable Y. Assume Y admits a density g with respect to a given measure depending on some parameter \theta taking values in an open subset \Theta of \mathbb{R}^{d}, such that the log-likelihood function \log g is differentiable on \Theta and that \|\partial_\theta \log g(y;\theta) (\partial_\theta \log g(y;\theta))^t\| is integrable, where x^t stands for the transpose of a vector or a matrix x. Then, by definition, the Fisher information matrix is given for all \theta\in\Theta by: I(\theta) = E_\theta\left[\partial_\theta \log g(Y;\theta) (\partial_\theta \log g(Y;\theta))^t \right]. \tag{1} When this expression can not be analytically evaluated, people are interested in computing an estimate of the Fisher information matrix. Considering this expression, one can derive a first moment estimator of the Fisher information matrix based on a n-sample (y_1, \ldots, y_{n}) of observations: I_{n,sco}(\theta) = \frac{1}{n} \sum_{i=1}^n \partial_\theta \log g(y_i;\theta) (\partial_\theta \log g(y_i;\theta))^t. This estimate is indeed equal to the mean of the Gram matrices of the scores. One can also derive a second estimate from Equation 1 defined as I_{n,cov}(\theta) = \frac{1}{n} \sum_{i=1}^n \partial_\theta \log g(y_i;\theta) (\partial_\theta \log g(y_i;\theta))^t-\bar{s}\bar{s}^t, where \bar{s}=\frac{1}{n}\sum_{i=1}^n \partial_\theta \log g(y_i;\theta) (see e.g. (Scott W.-A. 2002)). We emphasize here that the terminology “empirical Fisher information matrix” is used in the literature for both estimates (see e.g. (Kunstner F., Hennig P., and Balles L. 2019)).
If the log-likelihood function \log g is twice differentiable on \Theta, the following relation also holds for all \theta \in \Theta: I(\theta) = - E_\theta\left[\partial_\theta^2 \log g(Y;\theta) \right]. \tag{2}
Considering this second expression, we can derive another moment estimator of the Fisher information matrix based on a n-sample (y_1, \ldots, y_{n}) of observations, called the observed Fisher information matrix defined as: I_{n,obs}(\theta) = - \frac{1}{n} \sum_{i=1}^n \partial_\theta^2 \log g(y_i;\theta).
Remark. We emphasize that the estimate I_{n,sco}(\theta) is always positive definite, since it is a mean of Gram matrices, contrary to the others estimates I_{n,obs}(\theta) and I_{n,cov}(\theta).
Remark. The asymptotical properties of the estimates I_{n,sco}(\theta) and I_{n,obs}(\theta) are straighforward when considering independent and identically distributed sample (y_1, \ldots, y_{n}). In particular, assuming standard regularity conditions on g, it follows directly from the central limit theorem that I_{n,sco}(\theta) and I_{n,obs}(\theta) are asymptotically normal.
Remark. Since both estimators I_{n,sco}(\theta) and I_{n,obs}(\theta) are moment estimates of I(\theta), they are unbiased for all \theta \in \Theta. This is not the case for I_{n,cov}(\theta). Regarding the variance, none of both estimators is better than the other one. This can be highlighted through the following examples. First consider a Gaussian sample with unknown expectation and fixed variance. Then, the variance of the estimator I_{n,obs}(\theta) is zero whereas the variance of the estimator I_{n,sco}(\theta) is positive. Second consider a centered Gaussian sample with unknown variance. Then, the variance of I_{n,sco}(\theta) is smaller than the variance of I_{n,obs}(\theta). Therefore, none of both estimators is more suitable than the other in general from this point of view.
If the variables Y_1, \ldots, Y_{n} are not identically distributed, for example if they depend on some individual covariates which is often the case, we state the following result under the less restrictive assumption of independent non identically distributed random variables.
Proposition 1 Assume that Y_1, \ldots, Y_{n} are independent non identically distributed random variables each having a parametric probability density function g_i depending on some common parameter \theta in an open subset \Theta of \mathbb{R}^p. Assume also that for all i the function \log g_i is differentiable in \theta on \Theta and that for all \theta \in \Theta, \partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t is integrable. Moreover assume that for all \theta in \Theta, \lim \frac1{n} \displaystyle\sum_{i=1}^n E_\theta(\partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t) exists and denote it by \nu(\theta). Then, for all \theta \in \Theta, the estimator I_{n,sco}(\theta) is defined, converges almost surely toward \nu(\theta) and is asymptotically normal. Assuming additionaly that \log g_i is twice differentiable in \theta on \Theta, the estimator I_{n,obs}(\theta) is defined, converges almost surely toward \nu(\theta) and is asymptotically normal.
Proof. We prove the consistency by applying the law of large numbers for non identically distributed variables. We establish the normality result by using characteristic functions. By recentering the terms E_\theta(\partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t), we can assume that \nu(\theta) equals zero. Let us denote by \phi_Z the characteristic function for a random variable Z. We have for all real t in a neighborhood of zero that: \begin{split} \| \phi_{I_{n,sco}(\theta)/ \sqrt{n}}(t)-1 \| & = \| \prod_{i=1}^n (\phi_{\partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t}(t/n)-1) \| \\ & \leq \prod_{i=1}^n \| \phi_{\partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t}(t/n)-1 \| \end{split} Computing a limited expansion in t around zero, we get the result.
Noting that for all 1 \leq i \leq n, E_\theta(\partial_\theta \log g_i(y;\theta) (\partial_\theta \log g_i(y;\theta))^t)=-E_\theta(\partial_\theta^2 \log g_i(y;\theta)), we get the corresponding results for the estimator I_{n,obs}(\theta).
Remark. The additional assumptions required when considering non identically distributed random variables are in the same spirit as those usually used in the literature. Let us quote for example (Nie L. 2006), (Silvapulle M. J. and Sen P. K. 2011), (Baey C., Cournède P.-H., and Kuhn E. 2019).
3 Computing the estimator I_{n,sco}(\theta) in latent variable model
Let us consider independent random variables Y_1, \ldots, Y_{n}. Assume in the sequel that there exist independent random variables Z_1, \ldots, Z_{n} such that for each 1 \leq i \leq n, the random vector (Y_i,Z_i) admits a parametric probability density function denoted by f parametrized by \theta \in \Theta. We present in this section dedicated tools to compute the estimator I_{n,sco}(\theta) in latent variable model when it can not be evaluated analytically.
3.1 Analytical expressions in latent variable models
In latent variable models, the estimator I_{n,sco}(\theta) can be expressed using the conditional expectation as stated in the following proposition.
Proposition 2 Assume that for all \theta \in\Theta the function \log g(\cdot;\theta) is integrable, that for all y the function \log g(y;\cdot) is differentiable on \Theta and that there exists an integrable function m such that for all \theta \in\Theta, \|\ \partial_\theta \log g(y;\theta) \| \leq m(y). Then for all \theta \in \Theta and all n \in \mathbb{N}^*: I_{n,sco}(\theta) = \frac{1}{n} \sum_{i=1}^n \mathrm{E}_{Z_i|Y_i;\theta} (\partial_\theta \log f(Y_i,Z_i;\theta) ) \mathrm{E}_{Z_i|Y_i;\theta} (\partial_\theta \log f(Y_i,Z_i;\theta) )^t, \tag{3}
where \mathrm{E}_{Z|Y;\theta} denotes the expectation under the law of Z conditionally to Y.
We apply the classical Fisher identity (Fisher R.A. 1925) to establish the equality stated in Proposition Proposition 2. This statement is indeed in the same spirit as the well-known Louis formulae for the observed Fisher information matrix estimate (Louis T. A. 1982).
Remark. In some specific cases the conditional expectations involved in the previous proposition admit exact analytical expressions for example in mixture models which are developed in Section 4 in some simulation studies.
3.2 Computing I_{n,sco}(\theta) using stochastic approximation algorithm
When exact computation of the estimator I_{n,sco}(\theta) is not possible, we propose to evaluate its value by using a stochastic algorithm which provides the estimate I_{n,sco}(\theta) as a by-product of the parameter estimates of \theta.
3.2.1 Description of the algorithm in curved exponential family model
We consider an extension of the stochastic approximation Expectation Maximization algorithm proposed by (Delyon B., Lavielle M., and Moulines E. 1999) which allows to compute the maximum likelihood estimate in general latent variables model. We assume that the complete log-likelihood belongs to the curved exponential family for stating the theoretical results. However our algorithm can be easily extended to general latent variables models (see Section 3.2.3). As our estimate involves individual conditional expectations, we have to consider an extended form of sufficient statistics for the model. Therefore we introduce the following notations and assumptions.
The individual complete data likelihood function is given for all 1 \leq i \leq n by: f_i(z_i;\theta) = \exp\left(-\psi_i(\theta) + \left<S_i(z_i),\phi_i(\theta)\right>\right), \tag{4} where \left<\cdot,\cdot\right> denotes the scalar product, S_i is a function on \mathbb{R}^{d_i} taking its values in a subset \mathcal{S}_i of \mathbb{R}^{m_i}.
Let us denote for all 1 \leq i \leq n by L_i the function defined on \mathcal{S}_i \times \Theta by L_i(s_i; \theta)\triangleq - \psi_i(\theta) + \left<s_i,\phi_i(\theta)\right> and by L: \mathcal{S} \times \Theta \to \mathbb{R} the function defined as L(s,\theta)=\sum_i L_i(s_i; \theta) with \mathcal{S}=\prod_i \mathcal{S}_i and s=(s_1,\ldots,s_n). For sake of simplicity, we omitted here all dependency on the observations (y_i)_{1 \leq i \leq n} since the considered stochasticity relies on the latent variables.
Finally let us denote by (\gamma_k)_{k \geq 1} a sequence of positive step sizes.
Moreover we assume that there exists a function \widehat{\theta} : \ \mathcal{S} \rightarrow \Theta, such that \forall s \in \mathcal{S}, \ \ \forall \theta \in \Theta, \ \ L(s; \widehat{\theta}(s))\geq L(s; \theta).
Initialization step: Initialize arbitrarily for all 1 \leq i \leq n s_i^0 and \theta_0.
Repeat until convergence the three steps defined at iteration k by:
Simulation step: for 1 \leq i \leq n simulate a realization Z_i^k from the conditional distribution given the observations Y_i denoted by p_i using the current parameter value \theta_{k-1}.
Stochastic approximation step: compute the quantities for all 1 \leq i \leq n s_i^{k} = (1-\gamma_k)s_i^{k-1} +\gamma_k S_i(Z_i^k) where (\gamma_k) is a sequence of positive step sizes satisfying \sum \gamma_k=\infty and \sum \gamma_k^2 <~\infty.
Maximisation step: update of the parameter estimator according to: \theta_{k} = \argmax_{\theta} \sum_{i=1}^n \left( -\psi_i(\theta) + \left<s_i^k,\phi_i(\theta)\right>\right) = \hat{\theta}(s^{k})
When convergence is reached, say at iteration K of the algorithm, evaluate the FIM estimator according to: I_{n,sco}^K = \frac{1}{n} \sum_{i=1}^n \hat{\Delta}_i\left(\hat{\theta}\left(s^{K}\right)\right) \hat{\Delta}_i\left(\hat{\theta}\left(s^{K}\right)\right)^t where \hat{\Delta}_i(\hat{\theta}(s)) = -\partial \psi_i(\hat{\theta}(s)) + \left<s_i,\partial \phi_i(\hat{\theta}(s))\right> for all s.
Remark. In the cases where the latent variables can not be simulated from the conditional distribution, one can apply the extension coupling the stochastic algorithm with a Monte Carlo Markov Chain procedure as presented in (Kuhn E. and Lavielle M. 2004). All the following results can be extended to this case.
3.2.2 Theoretical convergence properties
In addition to the exponential family assumption for each individual likelihood, we also make the same type of regularity assumptions as those presented in (Delyon B., Lavielle M., and Moulines E. 1999) at each individual level. These assumptions are detailed in the appendix section.
Theorem 1 Assume that (M1') and (M2'), (M3) to (M5) and (SAEM1) to (SAEM4) are fulfilled. Assume also that with probability 1 \mathrm{clos}(\{s_k\}_{k \geq 1}) is a compact subset of \mathcal{S}. Let us define \mathcal{L}=\{\theta \in\Theta, \partial_\theta l(y;\theta)=0\} the set of stationary points of the observed log-likelihood l. Then, for all \theta_0 \in \Theta, for fixed n \in \mathbb{N}^*, we get: \lim_k d(\theta_k,\mathcal{L})=0 and \lim_k d(I_{n,sco}^k,\mathcal{I})=0 where \mathcal{I}=\{I(\theta), \theta \in \mathcal{L}\}.
Proof. Let us denote by S(Z)=(S_1(Z_1),\ldots,S_n(Z_n)) the sufficient statistics of the model we consider in our approach. Note as recalled in (Delyon B., Lavielle M., and Moulines E. 1999), these are not unique. Let us also define H(Z,s)=S(Z)-s and h(s)=\mathrm{E}_{Z|Y;\theta}(S(Z))-s. Assumptions (M1') and (M2') imply that assumptions (M1) and (M2) of Theorem 5 of (Delyon B., Lavielle M., and Moulines E. 1999) are fulfilled. Indeed these assumptions focus on expressions and regularity properties of the individual likelihood functions and the corresponding sufficient statistics for each index i \in \{1,\ldots,n\}. Then by linearity of the log-likelihood function and of the stochastic approximation and applying Theorem 5 of (Delyon B., Lavielle M., and Moulines E. 1999), we get that \lim_k d(\theta_k,\mathcal{L})=0. Moreover we get that for 1 \leq i \leq n, each sequence (s_i^k) converges almost surely toward \mathrm{E}_{Z_i|Y_i;\theta} (S_i(Z_i) ). Since assumption (M2') ensures that for all 1 \leq i \leq n the functions \psi_i and \phi_i are twice continuously differentiable and assumption (M5) ensures that the function \hat{\theta} is continuously differentiable, the function \Phi_n defined by \Phi_n(s^{k})=\frac1{n}\sum_{i=1}^n \hat{\Delta}_i(\hat{\theta}(s^{k}))\hat{\Delta}_i(\hat{\theta}(s^{k})) is continuous. Therefore we get that \lim_k d(I_{n,sco}^k,\mathcal{I})=0.
We now establish the asymptotic normality of the estimate \bar{I}_{n,sco}^k defined as \bar{I}_{n,sco}^k=\Phi_n(\bar{s}^{k}) with \bar{s}^{k}=\sum_{l=1}^k s^l /k using the results stated by (Delyon B., Lavielle M., and Moulines E. 1999). Let us denote by Vect(A) the vector composed of the elements of the triangular superior part of matrix A ordered by columns.
Theorem 2 Assume that (M1') and (M2'), (M3) to (M5), (SAEM1'), (SAEM2), (SAEM3), (SAEM4') and (LOC1) to (LOC3) are fulfilled. Then, there exists a regular stable stationary point \theta^* \in \Theta such that \lim_k \theta_k=\theta^* a.s. Moreover the sequence (\sqrt{k}(Vect(\bar{I}_{n,sco}^k)-Vect(\bar{I}_{n,sco}(\theta^*))))\mathbb{1}_{\lim \| \theta_k-\theta^*\|=0 } converges in distribution toward a centered Gaussian random vector when k goes to infinity. The asymptotic covariance matrix is characterised.
Proof. The proof follows the lines of this of Theorem 7 of (Delyon B., Lavielle M., and Moulines E. 1999). Assumptions (LOC1) to (LOC3) are those of (Delyon B., Lavielle M., and Moulines E. 1999) and ensure the existence of a regular stable stationary point s^* for h and therefore of \theta^*=\hat{\theta}(s^*) for the observed log-likelihood l. Then applying Theorem 4 of (Delyon B., Lavielle M., and Moulines E. 1999), we get that: \sqrt{k}( \bar{s}^k - s^*) \mathbb{1}_{\lim \| s^k-s^*\|=0 } \overset{\mathcal{L}}{ \rightarrow} \mathcal{N}(0, J(s^*)^{-1} \Gamma(s^*) J(s^*)^{-1} )\mathbb{1}_{\lim \| s_k-s^*\|=0 } where the function \Gamma defined in assumption (SAEM4') and J is the Jacobian matrix of the function h. Applying the Delta method, we get that: \sqrt{k}( Vect(\Phi_n(\bar{s}^k)) - Vect(\Phi_n(s^*))) \mathbb{1}_{\lim \| s^k-s^*\|=0 } \overset{\mathcal{L}}{ \rightarrow} W\mathbb{1}_{\lim \| s^k-s^*\|=0 } where W \sim \mathcal{N}(0, \partial Vect(\Phi_n (s^*)) J(s^*)^{-1} \Gamma(s^*) J(s^*)^{-1} \partial Vect(\Phi_n (s^*))^t ) which leads to the result.
Note that as usually in stochastic approximation results, the rate \sqrt{k} is achieved when considering an average estimator (see Theorem 7 of (Delyon B., Lavielle M., and Moulines E. 1999) e.g).
3.2.3 Description of the algorithm for general latent variables models
In general settings, the SAEM algorithm can yet be applied to approximate numerically the maximum likelihood estimate of the model parameter. Nevertheless there are no more theoretical garantees of convergence for the algorithm. However we propose an extended version of our algorithm which allows to get an estimate of the Fisher information matrix as a by-product of the estimation algorithm.
Initialization step: Initialize arbitrarily \Delta_i^0 for all 1 \leq i \leq n, Q_0 and \theta_0.
Repeat until convergence the three steps defined at iteration k by:
Simulation step: for 1 \leq i \leq n simulate a realization Z_i^k from the conditional distribution given the observations Y_i, p_i, using the current parameter \theta_{k-1}.
Stochastic approximation step: compute the quantities for all 1 \leq i \leq n Q_{k}(\theta) = (1-\gamma_k)Q_{k-1}(\theta)+\gamma_k \sum_{i=1}^n \log f(y_i,Z_i^k;\theta) \Delta_i^{k} = (1-\gamma_k)\Delta_i^{k-1} +\gamma_k \partial_\theta \log f(y_i,Z_i^k;\theta_{k-1})
Maximisation step: update of the parameter estimator according to: \theta_{k} = \argmax_{\theta} Q_{k}(\theta).
When convergence is reached, say at iteration K of the algorithm, evaluate the FIM estimator according to: I_{n,sco}^K = \frac{1}{n} \sum_{i=1}^n \Delta_i^K (\Delta_i^K )^t.
We illustrate through simulations in a nonlinear mixed effects model the performance of this algorithm in Section 4.2.
4 Simulation study
In this section, we investigate both the properties of the estimators I_{n,sco}(\theta) and I_{n,obs}(\theta) when the sample size n grows and the properties of the proposed algorithm when the number of iterations grows.
4.1 Asymptotic properties of the estimators I_{n,sco}(\theta) and I_{n,obs}(\theta)
4.1.1 Simulation settings
First we consider the following linear mixed effects model y_{ij} = \beta + z_{i} + \varepsilon_{ij}, where y_{ij} \in \mathbb{R} denotes the j^{th} observation of individual i, 1\leq i \leq n, 1\leq j \leq J, z_i \in \mathbb{R} the unobserved random effect of individual i and \varepsilon_{ij} \in \mathbb{R} the residual term. The random effects (z_{i}) are assumed independent and identically distributed such that z_{i} \underset{i.i.d.}{\sim} \mathcal{N}(0,\eta^2), the residuals (\varepsilon_{ij}) are assumed independent and identically distributed such that \varepsilon_{ij} \underset{i.i.d.}{\sim} \mathcal{N}(0,\sigma^2) and the sequences (z_i) and (\varepsilon_{ij}) are assumed mutually independent. Here, the model parameters are given by \theta = (\beta, \eta^2, \sigma^2). We set \beta=3, \eta^2=2, \sigma^2=5 and J=12.
Second we consider the following Poisson mixture model where the distribution of each observation y_i (1\leq i \leq n) depends on a state variable z_i which is latent leading to y_i|z_i=k \sim \mathcal{P}(\lambda_k) with (z_i=k) = \alpha_k and \sum_{k=1}^{K} \alpha_k = 1. The model parameters are \theta=(\lambda_1,\ldots,\lambda_K,\alpha_1,\ldots,\alpha_{K-1}). For the simulation study, we consider a mixture of K=3 components, and the following values for the parameters \lambda_1=2, \lambda_2=5, \lambda_3=9, \alpha_1=0.3 and \alpha_2=0.5.
4.1.2 Results
For each model, we generate M=500 datasets for different sample sizes n \in \left\{20,100,500 \right\}. We do not aim at estimating the model parameters. We assume them to be known, and in the following we denote by \theta^{\star} the true parameter value. For each value of n and for each 1 \leq m \leq M, we derive I_{n,sco}^{(m)}(\theta^{\star}) and I_{n,obs}^{(m)}(\theta^{\star}). We compute the empirical bias and the root mean squared deviation of each component (\ell,\ell') of the estimated matrix as: \frac{1}{M} \sum\limits_{m=1}^{M} I_{n,sco,\ell,\ell'}^{(m)}(\theta^\star) - I_{\ell,\ell'}(\theta^\star) \; \; \; \mathrm{and} \; \; \; \sqrt{\frac{1}{M} \sum\limits_{m=1}^{M} \left(I_{n,sco,\ell,\ell'}^{(m)}(\theta^\star) - I_{\ell,\ell'}(\theta^\star)\right)^2}.
In the previous quantities, I(\theta^\star) is explicit in the linear mixed effects model and approximated by a Monte-Carlo estimation based on a large sample size in the Poisson mixture model. The results are presented in ?@tbl-IscoLMM and ?@tbl-IobsLMM for the linear mixed effects model and in ?@tbl-IscoMixt and ?@tbl-IobsMixt for the mixture model. We observe that whatever the model and whatever the components of I_{n,sco}(\theta^{\star}) and I_{n,obs}(\theta^{\star}), the bias is very small even for small values of n. Note that in this particular model the second derivatives with respect to parameter \beta is deterministic, which explains why the bias and the dispersion of the estimations I_{n,obs}(\theta^{\star}) are zero for every value of n. The bias and the standard deviation decrease as n increases overall, which illustrates the consistency of both M-estimators. The distributions of the normalized estimations \sqrt{n} \left(I_{n,sco}^{(m)}(\theta^\star) - I(\theta^\star)\right) and \sqrt{n} \left(I_{n,obs}^{(m)}(\theta^\star) - I(\theta^\star)\right) are also represented when n=500 for some components of the matrices in Figure 1 (linear mixed effects model) and ?@fig-simuPoissonMixture (Poisson mixture model). The empirical distributions have the shape of Gaussian distributions and illustrate the asymptotic normality of the two estimators. The numerical results highlight that neither I_{n,sco}(\theta^{\star}) nor I_{n,obs}(\theta^{\star}) is systematically better than the other one in terms of bias and asymptotic covariance matrix. In the same model, different behaviours can be observed depending on the components of the parameter vector.
Show the code
## Monte-Carlo estimation of the true Fisher information matrix based on a large sample
source('fonctions/sim_poisson_mixture.R')
source('fonctions/sim_poisson_mixture.R')
nMC <- 100000000
alpha <- c(0.3,0.5) # mixture weights of the first K-1 components
lambda <- c(2,5,9) # parameter values of the K Poisson distribution of the mixture
y <- sim_poisson_mixture(nMC,lambda,alpha)
trueFIM <- fisher_estimation_poisson_mixture(y, nMC, lambda, alpha)
trueFIM$Iobs
trueFIM$Isco
trueFIM <- (trueFIM$Isco+trueFIM$Iobs)/2
save(trueFIM,file='Rfiles/PoissonMixtureTrueFIM.Rdata')Show the code
### Numerical study of the statistical properties of the two estimators of the
### Fisher Information matrix in the Poisson mixture model
### ---------------------------------------------------------------------------
nbsim <- 500 # number of replicates
alpha <- c(0.3,0.5) # mixture weights of the first K-1 components
lambda <- c(2,5,9) # parameter values of the K Poisson distribution of the mixture
seq.n <- c(20,100,500) # sample size
Iobs.theta.est <- array(NA,dim=c(5,5,nbsim))
Isco.theta.est <- array(NA,dim=c(5,5,nbsim))
est.lambda <- matrix(NA,3,nbsim)
est.alpha <- matrix(NA,2,nbsim)
for (n in seq.n){
for (j in 1:nbsim){
## Data simulation
y <- sim_poisson_mixture(n,lambda,alpha)
## Parameter estimation
em.est <- em_poisson_mixture(y,3)
est.lambda[,j] <- em.est[[1]]
est.alpha[,j] <- em.est[[2]]
## Computation of Isco and Iobs in the MLE value of the parameter
res.theta.est <- fisher_estimation_poisson_mixture(y, est.lambda[,j],
est.alpha[,j])
Iobs.theta.est[,,j] <- res.theta.est$Iobs
Isco.theta.est[,,j] <- res.theta.est$Isco
}
ResSim <- list(n=n,Isco=Isco.theta.est,Iobs=Iobs.theta.est,lambda=lambda,
alpha=alpha)
filename <- paste('Rfiles/simusMixt_n',n,'.Rdata',sep="")
save(ResSim,file=filename)
}Show the code
load('Rfiles/PoissonMixtureTrueFIM.Rdata')
load("Rfiles/simusMixt_n20.Rdata")
biasIscoMixt20 <- round(apply(ResSim$Isco,c(1,2),mean) - trueFIM,5)
sdIscoMixt20 <- round(apply(ResSim$Isco,c(1,2),sd),5)
biasIobsMixt20 <- round(apply(ResSim$Iobs,c(1,2),mean) - trueFIM,5)
sdIobsMixt20 <- round(apply(ResSim$Iobs,c(1,2),sd),5)
load("Rfiles/simusMixt_n100.Rdata")
biasIscoMixt100 <- round(apply(ResSim$Isco,c(1,2),mean) - trueFIM,5)
sdIscoMixt100 <- round(apply(ResSim$Isco,c(1,2),sd),5)
biasIobsMixt100 <- round(apply(ResSim$Iobs,c(1,2),mean) - trueFIM,5)
sdIobsMixt100 <- round(apply(ResSim$Iobs,c(1,2),sd),5)
load("Rfiles/simusMixt_n500.Rdata")
biasIscoMixt500 <- round(apply(ResSim$Isco,c(1,2),mean) - trueFIM,5)
sdIscoMixt500 <- round(apply(ResSim$Isco,c(1,2),sd),5)
biasIobsMixt500 <- round(apply(ResSim$Iobs,c(1,2),mean) - trueFIM,5)
sdIobsMixt500 <- round(apply(ResSim$Iobs,c(1,2),sd),5)
dataBiasMixt <- cbind(component = c(1,2,3,4,5,6),
mIsco20 = c(diag(biasIscoMixt20)[2:5],biasIscoMixt20[2,3],biasIscoMixt20[3,5]),
mIobs20 = c(diag(biasIobsMixt20)[2:5],biasIobsMixt20[2,3],biasIobsMixt20[3,5]),
mIsco100 =c(diag(biasIscoMixt100)[2:5],biasIscoMixt100[2,3],biasIscoMixt100[3,5]),
mIobs100 =c(diag(biasIobsMixt100)[2:5],biasIobsMixt100[2,3],biasIobsMixt100[3,5]),
mIsco500 =c(diag(biasIscoMixt500)[2:5],biasIscoMixt500[2,3],biasIscoMixt500[3,5]),
mIobs500 =c(diag(biasIobsMixt500)[2:5],biasIobsMixt500[2,3],biasIobsMixt500[3,5]))
dataBiasMixt <- as.data.frame(dataBiasMixt)
dataBiasMixt$component <-factor(dataBiasMixt$component, levels=c(1,2,3,4,5,6),
labels=c("1"='(λ2,λ2)',
"2"='(λ3,λ3)',
"3"='(α1,α1)',
"4"='(α2,α2)',
"5"='(λ2,λ3)',
"6"='(λ3,α2)'))
dataBiasMixt <- flextable(dataBiasMixt, cwidth=1.2)
dataBiasMixt <- add_header_row(
x = dataBiasMixt, values = c("", "n=20", "n=100", "n=500"),
colwidths = c(1, 2, 2, 2))
dataBiasMixt <- set_header_labels(dataBiasMixt, mIsco20="Isco", mIobs20="Iobs",
mIsco100="Isco", mIobs100="Iobs", mIsco500="Isco",
mIobs500="Iobs", component="")
dataBiasMixt <- align(dataBiasMixt, part = "all", align = "center")
dataBiasMixtn=20 | n=100 | n=500 | ||||
|---|---|---|---|---|---|---|
Isco | Iobs | Isco | Iobs | Isco | Iobs | |
(λ2,λ2) | 0.00009 | -0.00025 | -0.00003 | 0.00025 | 0.00008 | -0.00029 |
(λ3,λ3) | 0.00005 | 0.00047 | -0.00023 | -0.00035 | 0.00008 | 0.00015 |
(α1,α1) | 0.05981 | 0.05981 | -0.04072 | -0.04072 | 0.01849 | 0.01849 |
(α2,α2) | 0.04756 | 0.04756 | -0.04006 | -0.04006 | 0.01205 | 0.01205 |
(λ2,λ3) | 0.00009 | 0.00009 | -0.00007 | -0.00007 | 0.00002 | 0.00002 |
(λ3,α2) | -0.00220 | 0.00082 | 0.00284 | -0.00061 | -0.00077 | 0.00041 |
Show the code
dataSdMixt <- cbind(component = c(1,2,3,4,5,6),
sIsco20 = c(diag(sdIscoMixt20)[2:5],sdIscoMixt20[2,3],sdIscoMixt20[3,5]),
sIobs20 = c(diag(sdIobsMixt20)[2:5],sdIobsMixt20[2,3],sdIobsMixt20[3,5]),
sIsco100 =c(diag(sdIscoMixt100)[2:5],sdIscoMixt100[2,3],sdIscoMixt100[3,5]),
sIobs100 =c(diag(sdIobsMixt100)[2:5],sdIobsMixt100[2,3],sdIobsMixt100[3,5]),
sIsco500 =c(diag(sdIscoMixt500)[2:5],sdIscoMixt500[2,3],sdIscoMixt500[3,5]),
sIobs500 =c(diag(sdIobsMixt500)[2:5],sdIobsMixt500[2,3],sdIobsMixt500[3,5]))
dataSdMixt <- as.data.frame(dataSdMixt)
dataSdMixt$component <-factor(dataSdMixt$component, levels=c(1,2,3,4,5,6),
labels=c("1"='(λ2,λ2)',
"2"='(λ3,λ3)',
"3"='(α1,α1)',
"4"='(α2,α2)',
"5"='(λ2,λ3)',
"6"='(λ3,α2)'))
dataSdMixt <- flextable(dataSdMixt, cwidth=1.2)
dataSdMixt <- add_header_row(
x = dataSdMixt, values = c("", "n=20", "n=100", "n=500"),
colwidths = c(1, 2, 2, 2))
dataSdMixt <- set_header_labels(dataSdMixt, sIsco20="Isco", sIobs20="Iobs",
sIsco100="Isco", sIobs100="Iobs", sIsco500="Isco",
sIobs500="Iobs", component="")
dataSdMixt <- align(dataSdMixt, part = "all", align = "center")
dataSdMixtn=20 | n=100 | n=500 | ||||
|---|---|---|---|---|---|---|
Isco | Iobs | Isco | Iobs | Isco | Iobs | |
(λ2,λ2) | 0.00717 | 0.02238 | 0.00310 | 0.00996 | 0.00141 | 0.00463 |
(λ3,λ3) | 0.01523 | 0.00872 | 0.00664 | 0.00403 | 0.00299 | 0.00167 |
(α1,α1) | 1.20192 | 1.20192 | 0.52483 | 0.52483 | 0.23129 | 0.23129 |
(α2,α2) | 1.05566 | 1.05566 | 0.46762 | 0.46762 | 0.20510 | 0.20510 |
(λ2,λ3) | 0.00295 | 0.00295 | 0.00132 | 0.00132 | 0.00059 | 0.00059 |
(λ3,α2) | 0.11013 | 0.03428 | 0.04614 | 0.01561 | 0.02137 | 0.00712 |
4.2 Asymptotic properties of the stochastic approximation algorithm
4.2.1 In curved exponential family model
We consider the following nonlinear mixed effects model which is widely used in pharmacokinetics for describing the evolution of drug concentration over time: y_{ij}=\frac{d_i ka_{i}}{V_i ka_{i}-Cl_{i}}\left[e^{-\frac{Cl_{i}}{V_i} t_{ij}} - e^{-ka_{i} t_{ij}}\right] + \varepsilon_{ij}, \tag{5} where ka_i, Cl_i and V_i are individual random parameters such that \log ka_{i} = \log(ka) + z_{i,1}, \log Cl_{i} = \log(Cl) + z_{i,2}, \log V_i = \log(V) + z_{i,3}. For all 1 \leq i \leq n, 1\leq j \leq J, y_{ij} denotes the measure of drug concentration on individual i at time t_{ij}, d_i the dose of drug administered to individual i, and V_i, ka_i and Cl_i respectively denote the volume of the central compartment, the drug’s absorption rate constant and the drug’s clearance of individual i. The terms z_{i} = (z_{i,1},z_{i,2},z_{i,3})' \in \mathbb{R}^3 are unobserved random effects which are assumed independent and identically distributed such that z_i \underset{i.i.d.}{\sim} \mathcal{N}(0,\Omega), where \Omega = \mathrm{diag}(\omega^2_{ka},\omega^2_{Cl},\omega^2_{V}), the residuals (\varepsilon_{ij}) are assumed independent and identically distributed such that \varepsilon_{ij} \underset{i.i.d.}{\sim} \mathcal{N}(0,\sigma^2) and the sequences (z_i) and (\varepsilon_{ij}) are assumed mutually independent. Here, the model parameter is given by \theta = (ka,V,Cl,\omega^2_{ka},\omega^2_{V},\omega^2_{Cl},\sigma^2). In this model, as in a large majority of nonlinear mixed effects models, the likelihood does not have any analytical expression. As a consequence, neither the Fisher Information Matrix, nor the estimators I_{n,sco}(\theta), I_{n,obs}(\theta) have explicit expressions. However, as the complete data log-likelihood is explicit, stochastic approximations of I_{n,sco}(\theta), I_{n,obs}(\theta) can be implemented. We take the following values for the parameters V=31, ka=1.6, Cl=2.8, \omega^2_V=0.40, \omega^2_{ka}=0.40, \omega^2_{Cl}=0.40 and \sigma^2=0.75. We consider the same dose d_i=320 and the same observation times (in hours): 0.25,0.5, 1, 2, 3.5, 5, 7, 9, 12, 24 for all the individuals. We simulate one dataset with n=100 individuals under model specified by Equation 5. On this simulated dataset, we run the stochastic approximation algorithm described in Section 3.2.1 for computing I_{n,sco}(\hat{\theta}) together with \hat{\theta} and the algorithm of (Delyon B., Lavielle M., and Moulines E. 1999) for computing I_{n,obs}(\hat{\theta}) M=500 times. We perform K=3000 iterations in total for each algorithm by setting \gamma_k=0.95 for 1 \leq k \leq 1000 (burn in iterations) and \gamma_k=(k-1000)^{-3/5} otherwise. At any iteration, we compute the empirical relative bias and the empirical relative standard deviation of each component (\ell,\ell') of I_{n,sco} defined respectively as: \frac{1}{M} \sum\limits_{m=1}^{M} \frac{\widehat{I_{n,sco,\ell,\ell'}^{(k,m)}} - I_{n,sco,\ell,\ell'}^{\star}}{I_{n,sco,\ell,\ell'}^{\star}} \; \; \; \mathrm{and} \; \; \; \sqrt{\frac{1}{M} \sum\limits_{m=1}^{M} \left(\frac{\widehat{I_{n,sco,\ell,\ell'}^{(k,m)}} - I_{n,sco,\ell,\ell'}^{\star}}{I_{n,sco,\ell,\ell'}^{\star}} \right)^2} where \widehat{I_{n,sco}^{(k,m)}} denotes the estimated value of I_{n,sco}(\hat{\theta}) at iteration k of the m^{th} algorithm. We compute the same quantities for I_{n,obs}. As the true values of I_{n,sco}^{\star}=I_{n,sco}(\theta^{\star}) and I_{n,obs}^{\star}=I_{n,obs}(\theta^{\star}) are not known, they are estimated by Monte-Carlo integration. The results are displayed in Figure 2 and Figure 3.
We observe that the bias and the standard deviations of the estimates of the components of both matrices decrease over iterations, and that for both estimates the bias is nearly zero when the convergence of the algorithm is reached. According to these simulation results, there is no evidence that one method is better than the other in terms of bias or standard deviation.
4.2.2 In general latent variable model
We use model specified by Equation 5 again, but we now consider that individual parameter V_i is fixed, i.e. V_i \equiv V \forall i = 1,\ldots,n. The model is no longer exponential in the sense of equation Equation 4. We must therefore use the general version of the stochastic approximation algorithm from Section 3.2.3 to compute I_{n,sco}(\hat{\theta}). We simulate 500 datasets according to this model and we estimate I_{n,sco}(\hat{\theta}) and \hat{\theta} for each one. We perform K=3000 iterations of the algorithm by setting \gamma_k=k^{-0.501}. We compute the 500 asymptotic confidence intervals of the model parameters [\hat{\theta}^{(\ell)}_k - q_{1-\alpha/2} \, \hat{\sigma}^{(\ell)}_k , \hat{\theta}^{(\ell)}_k + q_{1-\alpha/2} \, \hat{\sigma}^{(\ell)}_k], \ell =1,\ldots,6 and then deduce from them empirical coverage rates. The \hat{\sigma}^{(\ell)}_k’s are obtained through the diagonal terms of the inversed V_n(\hat{\theta}_k)’s, and q_{1-\alpha/2} stands for the quantile of order 1-\alpha/2 of a standard Gaussian distribution with zero mean. We obtain for the six parameters (ka,V,Cl,\omega^2_{ka},\omega^2_{Cl},\sigma^2) empirical covering rates of 0.946,0.928,0.962,0.944,0.950,0.942 respectively for a nominal covering rate of 0.95. This highlights that our estimate accurately quantifies the precisions of parameter estimates. Convergence graphs obtained from a simulated data set are shown in Figure 4. Although theoretical guarantee is missing in non exponential models, the stochastic approximation algorithm proposed in Section 3.2.3 converges in practice on this example for both the estimation of the model parameters and the estimation of the Fisher information matrix.
4.2.3 Comparison with other methods
To the best of our knowledge, although there exists contributions focusing on the estimation of the Fisher information matrix in latent variable models, there is currently no method based on the first derivatives of the log-likelihood. We compare to (Meng L. and Spall J. C. 2017) who proposed an iterative method based on numerical first order derivatives of the Q function that is computed at each E-step of the EM algorithm. The model used by (Meng L. and Spall J. C. 2017) in their simulation study is a mixture of two Gaussian distributions with unknown expectations \mu_1 and \mu_2, fixed variances equal to 1 and unknown proportion \pi. The model parameters are denoted by \theta=(\mu_1,\mu_2,\pi).
We simulate 10000 datasets according to this Gaussian mixture model, using the same setting as (Meng L. and Spall J. C. 2017), i.e. n=750, \pi=2/3, \mu_1=3 and \mu_2=0. For each dataset k=1,\ldots,10000, we compute the parameter maximum likelihood estimate \hat{\theta}_k = (\hat{\pi}_k,\widehat{\mu_1}_k,\widehat{\mu_2}_k) with an EM algorithm and then we derive I_{n,sco}(\hat{\theta}_k) directly according to Equation 3 contrary to (Meng L. and Spall J. C. 2017) who used an iterative method. We compute the empirical mean of the 10000 estimated matrices leading to:
\frac1{10000} \sum _k I_{n,sco}(\hat{\theta}_k)= \begin{pmatrix} 2685.184 & -211.068 & -251.808\\ -211.068 & 170.927 & -61.578 \\ -251.808 & -61.578 & 392.859\\ \end{pmatrix}.
Comparison with the results of (Meng L. and Spall J. C. 2017) is delicate since their numerical illustration of their method is based on a single simulated dataset thus potentially sensitive to sampling variations. However, they provide an estimation of the Fisher information matrix from this unique dataset
I_{Meng} = \begin{pmatrix}
2591.3 & -237.9 & -231.8\\
-237.9 & 155.8 & -86.7\\
-231.8 & -86.7 & 394.5
\end{pmatrix}.
Our results are coherent with their ones. To check the reliability of our results, we then compute as above the 10000 asymptotic confidence intervals of the three model parameters. We obtain for the three parameters (\pi,\mu_1,\mu_2) empirical covering rates of 0.953,0.949,0.951 respectively for a nominal covering rate of 0.95. Thus I_{n,sco} accurately quantifies the precisions of parameter estimates.
5 Conclusion and discussion
In this work, we address the estimation of the Fisher information matrix in general latent variable models. We focus on the empirical Fisher information matrix which is a moment estimate of the covariance matrix of the score. We propose a stochastic approximation algorithm to compute this estimate when it can not be calculated analytically and establish its theoretical convergence properties. We carry out a simulation study in mixed effects model and in a Poisson mixture model to compare the performances of several estimates, namely the considered empirical Fisher information matrix and the observed Fisher information matrix. We apply our methodology to real data which were fitted using a nonlinear mixed effects model. We emphasize that the empirical FIM requires less regularity assumptions than the observed FIM. From a computational point of view, the implementation of the algorithm for evaluating the empirical FIM only involves the first derivatives of the log-likelihood, in contrary to the one for evaluating the observed FIM which involves the second derivatives of the log-likelihood.
The main perspective of this work is to adapt the procedure for statistical models whose derivatives of the log-likelihood have no tractable expressions, coupling the algorithm with numerical derivative procedures.
6 Appendix
It is assumed that the random variables s^0, z_1, z_2, \cdots are defined on the same probability space (\Omega, \mathcal{A}, P). We denote \mathcal{F} = \{ \mathcal{F}_k \}_{k \geq 0} the increasing family of \sigma-algebras generated by the random variables s_0, z_1, z_2, \cdots, z_k. We assume the following conditions:
(M1’) The parameter space \Theta is an open subset of \mathbb{R}^{p}. The individual complete data likelihood function is given for all i=1,\ldots,n by: f_i(z_i;\theta) = \exp\left(-\psi_i(\theta) + \left<S_i(z_i),\phi_i(\theta)\right>\right), where \left<\cdot,\cdot\right> denotes the scalar product, S_i is a Borel function on \mathbb{R}^{d_i} taking its values in an open subset \mathcal{S}_i of \mathbb{R}^{m_i}. Moreover, the convex hull of S(\mathbb{R}^{\sum d_i}) is included in \mathcal{S} and for all \theta \in \Theta \int S(z) \prod p_i(z_i;\theta) \mu(dz) < \infty
(M2’) Define for each i L_i : \mathcal{S}_i \times \Theta \to \mathbb{R} as L_i(s_i; \theta)\triangleq - \psi_i(\theta) + \left<s_i,\phi_i(\theta)\right>.The functions \psi_i and \phi_i are twice continuously differentiable on \Theta.
(M3) The function \bar{s} : \Theta \rightarrow \mathcal{S} defined as \bar{s}(\theta) \triangleq \int S(z) p(z; \theta) \mu(dz) is continuously differentiable on \Theta.
(M4) The function l:\Theta \rightarrow \mathbb{R} defined as l(\theta) \triangleq \log g(\theta) = \log \int_{\mathbb{R}^{d_z}} f(z;\theta) \mu(dz) is continuously differentiable on \Theta and \partial_\theta \int f(z; \theta) \mu(dz)= \int \partial_\theta f(z; \theta) \mu(dz).
(M5) There exists a continuously differentiable function \widehat{\theta} : \ \mathcal{S} \rightarrow \Theta, such that: \forall s \in \mathcal{S}, \ \ \forall \theta \in \Theta, \ \ L(s; \widehat{\theta}(s))\geq L(s; \theta).
In addition, we define:
(SAEM1) For all k in \mathbb{N}, \gamma_k \in [0,1], \sum_{k=1}^\infty \gamma_k = \infty and \sum_{k=1}^\infty \gamma_k^2 < \infty.
(SAEM2) l:\Theta \rightarrow \mathbb{R} and \widehat{\theta} : \mathcal{S} \rightarrow \Theta are m times differentiable, where m is the integer such that \mathcal{S} is an open subset of \mathbb{R}^m.
(SAEM3) For all positive Borel functions \Phi E[ \Phi( z_{k+1}) | \mathcal{F}_k ] = \int \Phi( z ) p ( z; \theta_k) \mu( dz).
(SAEM4) For all \theta \in \Theta, \int \| S(z) \|^2 p( z; \theta) \mu (d z) < \infty, and the function \begin{split} \Gamma(\theta) \triangleq \mathrm{Cov}_\theta [S(z)]\triangleq &\int S(z)^t S(z) p(z;\theta)\mu(dz)\\ &-\left[\int S(z)p(z;\theta)\mu(dz)\right]^t\left[\int S(z)p(z;\theta)\mu(dz)\right] \end{split} is continuous w.r.t. \theta.
We also define assumptions required for the normality result:
(SAEM1’) For all k in \mathbb{N}, \gamma_k \in [0,1], \sum_{k=1}^\infty \gamma_k = \infty and \sum_{k=1}^\infty \gamma_k^2 < \infty. There exists \gamma^* such that \lim k^\alpha /\gamma_k =\gamma^*, and \gamma_k / \gamma_{k+1}=1 + O(k^{-1}).
(SAEM4’) For some \alpha>0, \sup_\theta \mathrm{E}_\theta(\|S(Z)\|^{2+\alpha})< \infty and \Gamma is continuous w.r.t. \theta.
(LOC1) The stationary points of l are isolated: any compact subset of \Theta contains only a finite number of such points.
(LOC2) For every stationary point \theta^*, the matrices \mathrm{E}_\theta^*(\partial_\theta L(S(Z),\theta^*) (\partial_\theta L(S(Z),\theta^*))^t) and \partial_\theta^2 L(\mathrm{E}_\theta^* (S(Z)),\theta^*) are positive definite.
(LOC3) The minimum eigenvalue of the covariance matrix R(\theta)=\mathrm{E}_\theta((S(Z)-\bar{s}(\theta))(S(Z)-\bar{s}(\theta))^t) is bounded away from zero for \theta in any compact subset of \Theta.
6.1 R functions
6.1.1 R function for Fisher information matrix estimation in the Poisson mixture model
Show the code
## Function for computing Isco and Iobs for Fisher Information matrix estimation
## in Poisson mixture models
fisher_estimation_poisson_mixture <- function(y, lambda, alpha) {
# y : vector of observations
# lambda : vector of K Poisson parameters for each component of the mixture
# (in ascending order)
# alpha : vector of (K-1) mixture proportions
K <- length(lambda) # number of components of the mixture
n <- length(y) # sample size
deriv1ind <- matrix(0,2*K-1,n)
deriv2 <- matrix(0,2*K-1,2*K-1)
covderiv <- matrix(0,2*K-1,2*K-1)
## computation of conditional expectation of the first derivatives of the
## complete data log-likelihood
denom <- 0
for (k in 1:(K-1)){
denom <- denom + exp(-lambda[k])*lambda[k]^y*alpha[k]
}
denom <- denom + exp(-lambda[K])*lambda[K]^y*(1-sum(alpha))
for (k in 1:(K-1)){
deriv1ind[k,] <- exp(-lambda[k])*lambda[k]^y*alpha[k]/denom*(y/lambda[k]-1)
deriv1ind[K+k,] <- exp(-lambda[k])*lambda[k]^y/denom -
exp(-lambda[K])*lambda[K]^y/denom
}
deriv1ind[K,] <- exp(-lambda[K])*lambda[K]^y*(1-sum(alpha))/denom*(y/lambda[K]-1)
## computation of conditional expectation of the second derivatives of the
## complete data log-likelihood
for (k in 1:(K-1)){
deriv2[k,k] <- sum(exp(-lambda[k])*lambda[k]^y*alpha[k]/denom*(-y/lambda[k]^2))
deriv2[K+k,K+k] <- sum(-exp(-lambda[k])*lambda[k]^y/denom/alpha[k] -
exp(-lambda[K])*lambda[K]^y/denom*(1/(1-sum(alpha))))
}
for (k in 1:(K-2)){
for (l in (k+1):(K-1)){
deriv2[K+k,K+l] <- - sum(exp(-lambda[K])*lambda[K]^y/denom*(1/(1-sum(alpha))))
deriv2[K+l,K+k] <- deriv2[K+k,K+l]
}
}
deriv2[K,K]<-sum(exp(-lambda[K])*lambda[K]^y*(1-sum(alpha))/denom*(-y/lambda[K]^2))
## computation of the conditional covariance matrix of the first derivatives of
## the complete data log-likelihood
for (k in 1:(K-2)){
covderiv[k,k] <- sum(exp(-lambda[k])*lambda[k]^y*alpha[k]/denom*(-1+y/lambda[k])^2)
covderiv[k+K,k+K] <- sum(exp(-lambda[k])*lambda[k]^y/denom/alpha[k] +
exp(-lambda[K])*lambda[K]^y/denom/(1-sum(alpha)))
for (l in (k+1):(K-1)){
covderiv[k+K,l+K] <- sum(exp(-lambda[K])*lambda[K]^y/denom/(1-sum(alpha)))
covderiv[l+K,k+K] <- covderiv[k+K,l+K]
}
covderiv[k,K+k] <- sum(exp(-lambda[k])*lambda[k]^y/denom*(-1+y/lambda[k]))
covderiv[k+K,k] <- covderiv[k,K+k]
covderiv[K,K+k] <- sum(exp(-lambda[K])*lambda[K]^y/denom*(-1+y/lambda[K])*(-1))
covderiv[K+k,K] <- covderiv[K,K+k]
}
covderiv[K-1,K-1] <- sum(exp(-lambda[K-1])*lambda[K-1]^y*alpha[K-1]/denom*(-1+y/lambda[K-1])^2)
covderiv[2*K-1,2*K-1] <- sum(exp(-lambda[K-1])*lambda[K-1]^y/denom/alpha[K-1]+
exp(-lambda[K])*lambda[K]^y/denom/(1-sum(alpha)))
covderiv[K-1,2*K-1] <- sum(exp(-lambda[K-1])*lambda[K-1]^y/denom*(-1+y/lambda[K-1]))
covderiv[2*K-1,K-1] <- covderiv[K-1,2*K-1]
covderiv[K,2*K-1] <- sum(exp(-lambda[K])*lambda[K]^y/denom*(-1+y/lambda[K])*(-1))
covderiv[2*K-1,K] <- covderiv[K,2*K-1]
covderiv[K,K] <- sum(exp(-lambda[K])*lambda[K]^y*(1-sum(alpha))/denom*(-1+y/lambda[K])^2)
## computation of Isco and Iobs
Isco <- deriv1ind%*%t(deriv1ind)/n
Iobs <- deriv1ind%*%t(deriv1ind)/n - deriv2/n - covderiv/n
res <- list(Isco = Isco, Iobs = Iobs)
return(res)
}References
Reuse
Citation
@article{delattre2023,
author = {Maud Delattre and Estelle Kuhn},
title = {Computing an Empirical {Fisher} Information Matrix Estimate
in Latent Variable Models Through Stochastic Approximation},
journal = {Computo},
date = {2023-04-23},
url = {https://computo.sfds.asso.fr/template-computo-quarto},
doi = {xxxx},
langid = {en},
abstract = {The Fisher information matrix (FIM) is a key quantity in
statistics. However its exact computation is often not trivial. In
particular in many latent variable models, it is intricated due to
the presence of unobserved variables. Several methods have been
proposed to approximate the FIM when it can not be evaluated
analytically. Different estimates have been considered, in
particular moment estimates. However some of them require to compute
second derivatives of the complete data log-likelihood which leads
to some disadvantages. In this paper, we focus on the empirical
Fisher information matrix defined as an empirical estimate of the
covariance matrix of the score, which only requires to compute the
first derivatives of the log-likelihood. Our contribution consists
in presenting a new numerical method to evaluate this empirical
Fisher information matrix in latent variable model when the proposed
estimate can not be directly analytically evaluated. We propose a
stochastic approximation estimation algorithm to compute this
estimate as a by-product of the parameter estimate. We evaluate the
finite sample size properties of the proposed estimate and the
convergence properties of the estimation algorithm through
simulation studies.}
}